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FOREWORD 


Determination  of  sample  size  is  a  problem  that  has  both  practical  and 
statistical  implications  for  gunnery  performance  research.  This  report 
discusses  these  implications  and  provides  a  power  analysis  technique  for 
calculating  the  minimum  detectable  difference  (HDD)  between  two  independent 
samples.  This  technique  was  used  to  calculate  a  table  of  MDDs  for  typical 
sample  sizes  found  in  gunnery  research.  Researchers  can  use  this  table  to 
make  tradeoffs  between  sample  sizes  and  HDD. 

This  research  is  part  of  the  U.S.  Army  Research  Institute  for  the  Be¬ 
havioral  and  Social  Sciences  (ARI)  task  entitled  "Application  of  Technology  to 
Meet  Armor  Skills  Training  Needs . "  That  task  is  performed  under  the  auspices 
of  ARI's  Armor  Research  and  Development  Activity  at  Fort  Knox,  whose  mission 
includes  designing  and  executing  human  performance  research  in  armor  gunnery. 
The  results  presented  in  this  report  were  briefed  to  the  Commanding  General 
and  Staff  of  the  U.S.  Army  Armor  Center  (USAARMC)  for  consideration  in  devel¬ 
oping  future  crew  and  platoon  qualification  tables.  The  methods  outlined  in 
this  report  are  being  used  by  the  Directorate  of  Evaluation  and  Standardiza¬ 
tion  to  determine  the  sample  size  requirements  of  their  evaluations.  Finally, 
the  power  analysis  techniques  were  used  in  a  companion  paper  entitled  "De¬ 
scription  and  Prediction  of  Grafenwoehr  Ml  Tank  Table  VIII  Performance"  to 
determine  distribution  effects  that  are  required  for  significant  differences 
on  Table  VIII  type  gunnery  data. 

The  proponent  for  this  research  is  the  Training  and  Doctrine  Command 
(TRADOC) ,  and  the  user  is  the  USAARMC  (Letter  of  Agreement  with  ARI  entitled 
"Establishment  of  Training  Technology  Field  Activity,  Ft.  Knox,  Kentucky," 
dated  4  November  1983).  Access  to  some  of  the  data  sources  was  provided  by 
Mr.  A1  Pomey  of  the  U.S.  Army  Armor  and  Engineer  Board. 


EDGAR  M.  JOHNSON 
Technical  Director 
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POWER  ANALYSIS  OF  GUNNERY  PERFORMANCE  MEASURES:  DIFFERENCES  BETWEEN  MEANS  OF 
TWO  INDEPENDENT  GROUPS 


EXECUTIVE  SUMMARY 


Requirement : 

Determination  of  sample  size  (£)  is  a  problem  that  has  both  practical  and 
statistical  implications  for  gunnery  performance  research.  The  purpose  of 
this  research  was  to  make  the  techniques  of  power  analysis  more  accessible  for 
the  gunnery  researcher  so  that  he  can  make  informed  decisions  about  sample 
size . 


Procedure : 

Performance  variability  estimates  were  obtained  from  gunnery  performance 
on  Table  VIII  qualifications  at  Grafenwdhr  and  from  published  research  on 
U'COFT.  These  data  were  used  in  examples  to  describe  power  analysis  proce¬ 
dures  developed  by  Welkowitz,  Ewen,  and  Cohen  (1982)  for  determining  power  and 
sample  size. 


Findings : 

Estimates  of  standard  deviations  were  obtained  on  four  measures  taken 
from  Table  VIII  and  seven  measures  from  U-COFT  research.  For  the  measures 
that  were  common  to  both  media  (opening  time,  percent  first  round  hits,  and 
percent  hits),  the  estimates  were  remarkably  similar.  Using  a  variant  of  the 
power  analysis  procedures,  these  data  were  used  to  calculate  minimum  detect¬ 
able  differences  (MDDs)  between  independent  groups  of  crews  using  a  two-tailed 
test  of  significance  given  the  standard  significance  criterion  of  .05  and 
power  of  .80.  The  most  notable  finding  from  this  analysis  was  that  statisti¬ 
cal  comparisons  of  company-sized  samples  (i.e.,  H  -  14)  are  insensitive  to 
differences  in  speed  and  accuracy  of  gunnery  performance. 


Utilization  of  Findings: 

The  advantage  to  using  the  table  of  MDDs  provided  in  this  report  is  that 
the  researcher  does  not  have  to  determine  a  difference  between  means  a  priori. 
He  can  instead  propose  a  performance  measure  and  sample  size  and  see  if  the 
value  of  the  MDD  is  "reasonable"  for  his  needs.  The  table  also  permits  the 
researcher  to  make  tradeoffs  between  sample  size  and  detectable  difference. 
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POWER  ANALYSIS  OF  GUNNERY  PERFORMANCE  MEASURES: 
DIFFERENCES  BETWEEN  MEANS  OF  TWO  INDEPENDENT  GROUPS 

INTRODUCTION 


Problem 


Determination  of  sample  size  (N)  is  a  problem  that  has  both  practical 
and  statistical  implications  for  gunnery  performance  research.  Samples 
that  are  too  large  are  clearly  wasteful  of  manpower  and  equipment 
resources.  On  the  other  hand,  samples  that  are  too  small  may  be  invalid 
for  parametric  statistical  analysis.  With  regard  to  the  latter  point, 
statisticians  caution  that  samples  should  be  large  enough  that  the  normal 
distribution  provides  a  close  approximation  for  the  sampling  distribution 
of  means.  That  value  of  N  is  generally  regarded  as  30,  which  is  also  used 
as  a  common  break  point  between  ‘small*  and  ’large*  samples.  However, 

Hays  (1963)  stated  that  sample  sizes  as  small  as  10  may  be  large  enough 
that  the  sampling  distribution  of  means  is  sufficiently  approximated  by 
the  normal  distribution.  Indeed,  a  casual  perusal  of  the  published 
research  literature  indicates  that  Ns  as  small  as  10-12  are  not  uncommon. 

Whereas  statistical  comparisons  based  on  Ns  as  small  as  10  may  be 
valid  in  terms  of  the  assumptions  of  parametric  statistics,  such  tests  may 
not  be  sensitive  enough  to  detect  meaningful  differences  between  groups. 

In  that  regard,  Boldovicl  (1987)  elaborated  on  the  fact  that  findings  of 
no  statistical  differences  between  groups  can  result  from  causes  other 
than  the  absence  of  actual  differences  between  means.  In  examining 
*...the  adequacy  of  the  research  and  reporting  upon  which  estimates  of 
[training]  device  effectiveness  are  based*  (p.  240),  he  proposed 
inadequate  sample  size  as  one  reason  that  results  of  tank  gunnery  research 
often  do  not  show  proficiency  differences  due  to  different  training 
conditions,  and  recommended  that  power  tests  be  used  to  estimate  sample 
sizes. 

Power  analyses  are  not  typically  reported  in  gunnery  research. 

Perhaps  these  analyses  are  actually  performed  but  not  reported.  However, 
it  is  more  likely  that  they  have  not  been  performed  at  all  for  two 
reasons.  First,  practical  power  analysis  procedures  were  first  introduced 
by  Cohen  (1969)  and  have  only  recently  filtered  down  to  introductory 
statistical  textbooks  (e.g.,  Welkowitz,  Ewen,  &  Cohen,  1982;  Shavelson, 
1988).  Researchers  are  not  likely  to  be  as  familiar  with  power  analysis 
procedures  as  they  are  with  older,  more  established  statistical 
procedures.  Second,  the  detailed  gunnery  performance  data  required  for 
power  analyses  have  not  been  available  to  researchers.  However,  this 
situation  is  also  changing  with  the  recent  influx  of  data  on  Table.  VIII 
live-fire  performance  and  empirical  research  on  U-COFT  simulator 
performance. 
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Research  Objectives 


The  ultimate  purpose  of  the  present  research  is  to  oiake  the 
techniques  of  power  analysis  more  accessible  for  the  gunnery  researcher  so 
that  he  can  make  informed  decisions  about  sample  size.  To  accomplish  this 
purpose,  the  research  addressed  the  following  specific  objectives: 

.  to  present  the  basic  concepts  of  power  analysis  in  the  context  of 
gunnery  research. 


.  to  compile  Table  VIII  and  U-COFT  gunnery  performance  data  that  is 
required  to  perform  power  analyses, 

.  to  present  some  examples  of  how  statistical  power  analyses  can  be 
used  to  test  the  significance  of  the  difference  between  means  of 
two  Independent  groups,  and 

.  to  discuss  the  generality  and  limitations  of  the  proposed  power 
analysis  techniques. 


ARMOR  GUNNERY  RESEARCH  AND  THE  DETERMINANTS  OF  POWER 


To  illustrate  some  of  the  fundamental  concepts  of  power  analysis. 
Figure  1  presents  sampling  distributions  that  apply  to  a  significance  test 
of  the  difference  between  means  from  independent  groups.  The  two  curves 
represent  sampling  distributions  of  the  difference  between  measures 
(Ml  -  M2)  under  two  assumptions:  The  left  distribution  assumes  that  the 
Hq  is  true  (i.e.,  pi  -  P2)  therefore  centered  at  zero,  whereas  the 

right  assumes  that  Hq  is  false  and  may  be  centered  at  any  value  other  than 
zero.  In  the  present  example,  the  actual  value  of  Pi  is  assumed  to  be 
greater  than  \i2',  thus,  the  mean  of  the  distribution  of  differences  is 
greater  than  zero.  On  the  abscissa  are  two  values  of  Mi  -  M2  (i.e.,  -c 
and  -fc)  that  represent  critical  values  of  the  test  statistic  required  to 
reject  the  null  hypothesis:  If  the  obtained  difference  between  sample 
means  falls  between  c-  and  c-f,  Hq  is  retained;  if  the  differences  falls 
outside  of  either  criterion,  Hq  is  rejected.  Note  that  in  any  given 
situation,  Hq  is  either  true  or  false  so  that  only  one  of  the  two  sampling 
distributions  actually  applies.  However,  overlapping  the  distributions 
illustrates  how  the  probabilities  of  outcomes  of  a  statistical  test  are 
interrelated. 

Two  types  of  errors  can  be  committed  in  statistical  decision  making. 

A  Type  I  error  is  defined  as  lejecting  a  true  null  hypothesis.  The 
probability  of  a  Type  I  error  is  equal  to  o .  In  a  two-tailed  test  as 
illustrated  in  Figure  1,  a  is  divided  equally  between  the  two  tails  of  the 
sampling  distribution  that  assumes  Hq  is  true.  A  Type  II  error  is  defined 
as  failing  to  reject  a  false  null  hypothesis.  The  probability  of  a  Type 
II  error  (fi)  is  represented  on  the  distribution  that  assumes  Hq  is  false 
as  the  area  that  falls  short  (to  the  left  of)  -i-c.  In  contrast  to  these 
two  errors,  power  is  defined  as  the  probability  of  making  a  correct 
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decision,  i.e.,  correctly  rejecting  a  false  null  hypothesis.  As  can  be 
seen  in  the  figure,  power  is  equal  to  1  -  B— the  complement  of  the 
probability  of  committing  a  Type  II  error.  In  other  words,  power 
represents  the  sensitivity  of  a  test  to  detect  real  differences.  Thus,  it 
is  in  the  researcher’s  interest  to  maximize  the  value  of  power  while 
minimizing  the  values  of  a  and  B. 

Power  is  determined  by  four  interrelated  factors:  (a)  the  criterion 
of  significance  or  (ii  (b)  the  size  of  the  sample  or  N,  (c)  the  variability 
of  performance  measures  as  indicated  by  the  standard  deviation  or  a,  and 
(d)  the  actual  difference  between  population  means  or  first 

two  factors  are  under  the  direct  control  of  the  experimenter,  whereas  the 
second  two  are,  at  most,  only  Indirectly  controllable.  The  extent  to 
which  these  factors  may  be  controlled  to  affect  power  is  discussed  below 
with  regard  to  standard  research  practices,  practical  constraints  that 
face  the  gunnery  researcher,  and  available  gunnery  performance  data. 


Significance  Criterion 


Value  of  g.  Assuming  the  Hq  is  true,  the  sampling  distribution  of' 
the  mean  difference  and  a  may  be  specified  a  priori.  Choosing  a  larger 
(less  stringent)  value  for  a  increases  the  power  of  the  test.  With 
reference  to  Figure  1,  increasing  a  results  in  decreasing  the  absolute 
values  of  the  test  statistic  required  for  significance  (|1  c|).  The 
proportion  of  the  right-hand  curve  beyond  the  critical  value  (i.e.,  1  -  B) 
would  be  thereby  increased.  However,  the  price  to  pay  for  increasing  a 
is,  by  definition,  increasing  the  probability  of  committing  a  Type  I  error 
(rejecting  a  true  Hq)- 

Researchers  have  typically  set  a  standard  value  for  the  significance 
criterion  at  a  -  .05  (two-tailed),  a  convention  that  is  usually  traced  to 
Fisher’s  original  (1925)  text  on  analysis  of  variance.  Statistics 
textbook  authors  often  characterize  the  .05  level  as  an  arbitrary 
convention.  In  contrast,  Cowles  and  Davis  (1982)  argued  that  there  are 
historical  precedents  for  this  value  that  predate  Fisher’s  work. 
Furthermore,  these  researchers  cite  their  own  data  on  subjective 
probability  suggesting  that  the  human  attribution  of  cause  (as  opposed  to 
chance)  for  probabilistic  events  occurs  somewhere  between  .10  and  .01,  a 
finding  that  supports  the  .05  convention.  Thus,  the  following  power 
analyses  assume  the  standard  .05  value  for  a  for  the  sake  of  analytic 
conventions,  historical  precedents,  and  agreement  with  human  judgment. 

One-  vs.  two-tailed  tests.  Power  can  also  be  increased  by  using  a 
one-tailed  as  opposed  to  the  standard  two-tailed  test.  In  a  one-tailed 
test,  a  is  represented  at  one  or  the  other  tall  of  the  sampling 
distribution  instead  of  being  split  between  two  tails  as  shown  in 
Figure  1.  The  advantage  of  the  one-tailed  test  is  that  it  effectively 
lowers  the  absolute  value  of  c  (thereby  increasing  power)  without 
increasing  the  overall  probability  of  a  Type  I  error.  On  the  other  hand, 
only  under  exceptional  conditions  will  a  researcher  in  the  behavioral 
sciences  have  enough  Information  to  make  a  directional  prediction  that  is 


A 


appropriate  for  a  one-tailed  test.  Even  If  he  were  able  to  make  such  a 
prediction,  a  result  opposite  from  that  predicted  may  not  be  inconsistent 
with  other  theoretical  points  of  view.  In  fact,  results  that  run  counter 
to  predictions  may  be  the  most  useful  in  both  a  scientific  and  practical 
sense  (D.  V.  Bessemer,  personal  communication,  April  1988).  For  these  and 
other  reasons,  statistical  textbook  authors  (e.g.,  Kirk,  198A ;  Glass  & 
Stanley,  1970)  generally  try  to  dissuade  students  from  using  the  one- 
tailed  procedure.  Following  that  advice,  the  following  power  analyses 
will  assume  two-tailed  tests  of  significance. 


Sample  Size 


As  implied  by  the  central  limit  theorem,  an  Increase  in  sample  size 
reduces  the  variance  of  the  sampling  distribution.  With  reduced  variance, 
the  test  statistic  values  fall,  on  average,  closer  to  the  mean  value. 
Therefore,  assuming  a  constant  value  for  a,  reduction  of  the  sampling 
variance  results  in  a  lower  absolute  value  of  the  test  statistic  required 
for  significance  (l.e.,  |c|).  In  reference  to  Figure  1,  lowering  this 
value  would  increase  the  proportion  of  the  right-hand  curve  (Hq  false) 
that  is  beyond  the  critical  value.  Thus,  increasing  N  increases  power  and 
reduces  the  probability  of  a  Type  II  error  without  a  necessary  increase  in 
o. 


Firing  the  tank  under  normal  conditions  requires  the  coordinated 
efforts  of  four  crewmen.  Thus,  the  sampling  unit  is  the  tank  crew  rather 
than  the  individual  soldier.  The  number  of  crews  available  for  research 
is  often  constrained  for  practical  or  logistic  reasons.  One  important 
constraint  is  that  crews  are  frequently  assigned  to  research  projects  as 
Intact  units  (i.e.,  companies,  battalions,  or  brigades).  Assuming  equal 
sample  sizes,  the  resulting  comparison  groups  are  between  experimental 
groups  that  are  equal  to  these  units,  or  some  fraction  thereof.^ 

Therefore,  it  is  useful  to  consider  the  standard  Army  armor  units  that  may 
apply  to  research  projects.  Note  that  higher  echelon  units  (division, 
corps,  etc.)  are  not  considered  in  the  following  discussion,  because  that 
have  variable  numbers  of  elements  and  are  considered  unrealistically  large 
as  individual  samples. 


^The  fact  that  crews  are  assigned  to  experiments  as  intact  units  does  not 
imply  that  all  crews  within  a  unit  should  be  assigned  to  the  same 
experimental  condition  within  the  experiment.  Assignment  of  intact  units 
to  experimental  conditions  confounds  between-unit  differences  with 
treatment  effects.  In  addition,  within-group  variability  estimates  for 
intact  groups  underestimate  the  variability  inherent  in  the  population 
because  they  exclude  between-unit  differences  (D.  V.  Bessemer,  personal 
communication,  April  1988).  The  researcher  should  instead  randomly 
assign  crews  to  experimental  conditions  regardless  of  their  unit  membership. 
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Platoon.  The  smallest  armor  unit  is  the  platoon  which  consists  of 
four  tanks.  For  most  measures,  a  sample  size  of  four  is  too  small  to 
estimate  population  parameters  because  of  the  exceptionally  large 
variability  of  the  sampling  distribution.  Also,  the  sampling  distribu> 
tions  of  small  samples  are  poorly  fit  by  the  normal  distribution. 
Consequently,  traditional  parametric  statistical  techniques  are 
inappropriate  for  platoon-sized  samples. 

Company .  The  next  larger  unit  is  the  company  which  consists  of  three 
platoons  having  four  tanks  per  platoon  plus  two  additional  tanks  for  the 
company  commander  and  his  executive  officer.  The  total  number  of  crews 
available  from  a  company  (14)  represents  perhaps  the  minimum  acceptable 
sample  size.  Note,  however,  that  with  normal  attrition  that  occurs  in 
gunnery  research  (e.g.,  crews  not  ^showing  up,  equipment  breaking  down, 
etc.),  the  actual  number  of  available  crews  from  one  company  may  be 
unacceptably  small  for  parametric  analysis. 

Battalion.  The  next  larger  unit  is  the  battalion,  which  consists  of 
four  companies  having  14  tanks  per  company  plus  two  additional  tanks  for 
the  battalion  commander  and  his  executive  officer.  With  its  58  total 
crews,  the  battalion  would  provide  enough  crews  to  fulfill  the  most 
rigorous  requirement  of  parametric  statistics  even  with  substantial 
attrition. 

Brigade.  The  largest  unit  under  consideration  is  the  close  combat 
heavy  brigade.  According  to  doctrine,  this  type  of  brigade  consists  of 
two  armor  battalions  and  one  mechanized  infantry  battalion.^  No  tanks  are 
assigned  to  the  mechanized  infantry  battalion  nor  are  there  tanks  assigned 
to  brigade  headquarters  and  headquarters  company.  The  resulting  sample  of 
116  crews  provides  the  upper  limit  of  sample  sizes  under  consideration  and 
is  only  rarely  achieved  in  gunnery  research. 


Variability  of  Performance  Measures 

Reducing  the  variability  of  performance  measures  affects  sampling 
distributions  in  the  same  manner  as  does  increasing  sample  size:  The 
variability  of  the  sampling  distributions  is  reduced.  Therefore, 
decreasing  the  sampling  variance  has  the  same  effect  on  power  as  does 
increasing  sample  size:  It  increases  power  and  reduces  the  probability  of 
fi  without  an  increase  in  a. 

The  researcher  has  only  limited  control  over  the  variability  of 
performance  measures.  For  instance,  he  can  minimize  the  impact  of 
external  sources  of  variability  such  as  those  related  to  differences  in 
test  administration  and  scoring.  In  contrast,  the  experimenter  cannot 
control  internal  sources  of  variability  caused  by  Inherent  differences 
both  within  and  among  crews.  For  the  purposes  of  power  analysis,  however, 


^Actual  brigades  often  deviate  from  this  doctrinal  definition  as  required 
by  their  stated  mission. 
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he  need  only  estimate  the  magnitude  of  these  Internal  sources  of 
variability.  Table  1  presents  point  estimates  along  with  the 
corresponding  952  confidence  intervals  for  a  number  of  gunnery  performance 
measures  obtained  from  Table  VIII  and  TJ-COFT  data.  Appendix  A  presents 
formal  definitions  of  each  performance  measure  in  Table  1.  The  next 
sections  describe  how  the  estimates  were  obtained,  and  the  final  section 
compares  the  results  from  the  two  performance  measurement  media. 

Table  VIII.  The  Office,  Chief  of  Armor  (OCOA)  maintains  a  detailed 
data  base  on  gunnery  performance  on  Table  VIII  at  Grafenwbhr.  This  data 
base  is  implemented  on  an  IBM  mainframe  computer  and  updated  periodically. 
Recent  data  from  872  Ml  crews  who  underwent  qualification  sometime  in  the 
interval  from  November  1986  to  June  1987  were  transferred  to  an  MS-DOS- 
based  floppy  diskette  and  analyzed  using  statistical  analysis  software  for 
personal  computers.^  Data  on  four  performance  measures  were  analyzed: 


Table  1 

Standard  Deviations  Point  Estimates  and  95Z  Confidence  Intervals 
for  Gunnery  Performance  Measures 


Measure 

Units 

Measurement 

Medium 

Table 

VIII 

D- 

•COFT 

is 

Cl 

SD 

Cl 

Target  ID  Time 

Seconds 

1.6 

0.8-2. 9 

Opening  Time 

Seconds 

1.7 

1.6-1. 7 

2.0 

1.2-3. 4 

1st  Round  Hits 

Percent 

13 

12-14 

14 

6-33 

Hits 

Percent 

12 

11-13 

11 

6-18 

Elevation  Error 

Mils 

— 

— 

0.15 

0.06-0.41 

Azimuth  Error 

Mils 

— 

— 

0.34 

0.09-1.25 

Aiming  Error 

•Distance* 

— 

— 

0.28 

0.15-0.52 

Table  VIII  Score 

•Points* 

98 

93-103 

— 

— 

^The  data  base  Itself  was  provided  by  Al  Pomey  of  the  U.S.  Army  Armor  and 
Engineer  Board.  Standard  deviation  values  were  obtained  from  Hoffman 
(1988)  who  described  other  attributes  of  the  performance  measurement 
distributions  as  well.  I  thank  both  for  their  cooperation  in  obtaining 
the  data  required  for  the  power  analysis. 
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opening  time,  percent  first-round  hit,  percent  hit,  and  Table  VIII  score. 
The  standard  deviation  estimates  were  based  on  the  average  performance  by 
individual  crews  across  the  ten  engagements  on  Table  VIII.  Confidence 
intervals  for  each  standard  deviation  estimate  were  calculated  using  the 
chi-square  distribution  (Kirk,  1984).  The  point  estimates  and  confidence 
intervals  for  the  Table  VIII  data  are  presented  in  the  first  two  columns 
of  Table  1. 

U-COFT .  Standard  deviation  estimates  of  U-COFT  gunnery  performance 
were  based  on  published  research  performed  at  the  ARI  Armor  R  &  D  Activity 
at  Fort  Knox.  Appendix  Table  B  summarizes  this  literature  in  tabular 
form.  In  contrast  to  Table  VIII,  U-COFT  performance  tests  are  not 
standardized;  Instead,  they  are  customized  in  content  and  length  to  fit 
the  purposes  and  constraints  of  particular  experiments.  Note  that  some  of 
the  summary  data  are  based  on  only  a  few  data  points.  Sample  point 
estimates  of  standard  deviations  were  calculated  for  measures  for  which 
there  were  at  least  seven  data  points.  The  variances  of  the  six  measures 
(of  the  total  thirteen)  that  met  this  criterion  were  transformed 
logarithmically  to  approximate  a  normal  distribution.  The  transformed 
variances  were  then  treated  as  means  to  calculate  a  single  point  estimate 
and  confidence  Interval  for  each  measure  (Box,  Hunter,  &  Hunter,  1978).  - 
Point  estimates  and  confidence  intervals  were  based  on  means  and  standard 
deviations  of  the  transformed  variances  and  were  weighted  by  sample  size. 
These  variance  estimates  were  then  retransformed  by  antilogarithm  and 
converted  to  standard  deviation  values.  The  results  are  presented  in  the 
second  column  of  Table  1. 

Summary  of  results  and  comparisons  across  media.  For  the  three 
performance  measures  that  are  common  to  both  measurement  media  (opening 
time,  1st  round  hits,  and  hits),  confidence  intervals  of  the  standard 
deviation  estimates  from  the  Table  Vlll  data  were  much  smaller  than  those 
from  the  U-COFT  data.  This  result  was  expected  given  that  the  Table  VIII 
data  were  based  on  more  crews  and  were  obtained  under  standardized  testing 
conditions.  Despite  the  difference  in  the  stability  of  the  two  sets  of 
estimates,  the  point  estimates  of  the  standard  deviations  for  corres¬ 
ponding  performance  measures  are  nevertheless  remarkably  close  in  absolute 
values.  For  the  two  accuracy  measures  (first  round  hits  and  hits),  U-COFT 
estimates  of  the  standard  deviations  were  within  the  95Z  confidence 
interval  of  the  Table  VIII  estimate  indicating  that  standard  deviation 
estimates  from  U-COFT  data  were  not  unlikely  estimates  of  Table  VIII 
standard  deviations.  Despite  a  small  absolute  difference  between  the 
standard  deviation  estimates  for  the  third  measure  (opening  time),  the 
standard  deviation  estimate  calculated  from  the  U-COFT  data  fell  above  the 
upper  limit  of  the  Table  VIII  confidence  Interval.  This  greater 
variability  in  opening  times  may  be  due  to  the  difficulty  in  acquiring 
targets  on  U-COFT  that  has  been  reported  by  Graham  (1986)  and  others. 
Nevertheless,  the  standard  deviation  estimate  of  opening  times  from  the 
Table  VIII  data  fell  well  within  the  confidence  interval  for  the  U-COFT 
data  indicating  that  the  lower  Table  VIII  value  is  not  an  unlikely 
standard  estimate  for  the  U-COFT  data. 
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The  similarity  in  standard  deviation  estimates  were  unexpected  given 
the  problems  associated  with  measuring  live-fire  gunnery  performance 
(e.g..  Powers,  McCluskey,  Haggard,  Boycan,  &  Steinheiser,  -1975;  Fingerman, 
1978).  That  is.  Table  VIII  performance  was  expected  to  be  more  variable 
than  U-COFT  performance  due  to  the  greater  influence  of  external  sources 
of  variability.  Two  sets  of  factors  may  be  responsible  for  the 
similarities  in  the  standard  deviation  estimates.  First,  the  U-COFT  is 
designed  to  closely  model  tank  weapon  effects,  including  some  of  the 
external  sources  of  variability  such  as  round -to- round  dispersion  effects. 
Second,  the  Grafenw'phr  data  were  collected  under  relatively  standardized 
conditions.  This  practice  reduces  external  variability  due  to  differences 
in  test  adaiinistration. 


Difference  Between  Means 


Power  is  directly  related  to  the  actual  difference  that  exists 
between  population  means;  as  this  difference  increases,  so  does  power. 

With  reference  to  Figure  1,  an  increase  in  the  mean  difference  would  be 
represented  by  increasing  the  distance  between  the  two  sampling  distribu¬ 
tions.  Assuming  constant  a,  the  effect  of  increasing  the  difference 
between  means  would  then  be  to  increase  the  proportion  of  the  right-hand 
distribution  beyond  the  critical  value.  Thus,  increasing  the  difference 
between  means  increases  power  and  decreases  fi  without  affecting  a. 

The  difference  between  means  is  an  inherent  quality  of  the  treatment 
itself  and  is  controllable  by  the  experimenter  only  within  limits.  In 
general,  the  experimenter  should  ensure  that  the  treatment  effect  is  as 
large  as  possible  so  that  the  power  of  the  comparisons  is  sufficiently 
large.  For  example,  a  one-hour  exposure  to  experimental  training  program 
may  not  produce  a  sufficient  mean  difference  when  compared  to  a  no¬ 
treatment  control;  two-or  three-hour  exposures  may  be  needed.  This  is  not 
to  say  that  less  extreme  values  of  the  independent  variable  should  not  be 
compared  to  test  for  the  possibility  of  a  nonmonotonic  effect.  In  other 
cases,  comparisons  of  the  most  extreme  values  of  the  independent  variable 
may  not  make  sense. ^  Nevertheless,  the  researcher  should  be  aware  of  the 
implications  of  this  factor  for  research  design. 

In  terms  of  power  analyses,  the  researcher  must  be  able  to  provide  an 
estimate  of  the  true  difference  between  means.  Clearly  this  value  is  not 
known  a  priori;  if  it  were,  the  test  of  significance  would  be  pointless. 

On  the  other  hand,  the  experimenter  might  be  able  to  determine  what  this 
value  should  be.  In  other  words,  the  researcher  can  establish  a  minimal 
difference  that  he  thinks  is  meaningful  both  to  him  and  to  the  consumers 
of  his  research.  Once  this  value  is  determined,  the  procedure  described 
in  the  next  section  can  be  used  to  ensure  that  his  test  is  capable  of 


^I  thank  D.  V.  Bessemer  (personal  communication,  April  1988)  for  pointing 
out  the  advantages  of  comparing  differences  among  the  less  extreme  values 
of  the  independent  variable. 
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detecting  such  a  difference  with  predetermined  power  if  he  provides  the 
required  number  of  subjects. 


POWER  ANALYSIS  METHODS 


A  number  of  different  simple  methods  for  power  analysis  have  been 
recently  developed  (e.g.,  Friedman,  1982;  Shavelson,  1988;  Welkowitz, 
Ewen,  &  Cohen,  1982),  each  algebraically  equivalent  to  the  other. 

However,  the  procedure  outlined  by  Welkowitz  et  al.  (1982)  is  notable  for 
its  simplicity  and  clarity.  A  central  concept  in  their  technique  Is 
effect  size  (y),  which  effectively  combines  two  determinants  of  power: 
the  true  difference  between  means  and  the  variability  of  performance 
measures  or 


Y  -  (Pi  -  P2)/o*  CD 

Welkowitz  et  al.  (1982)  use  the  concept  of  effect  size  to  partition  power 
analysis  into  three  components:  Y,  N,  and  power.  Specification  of  any 
two  of  these  components  necessarily  determines  the  third. 

The  following  subsections  describe  two  general  power  analysis 
problems  that  are  discussed  by  Welkowitz  et  al.  and  a  third  approach  that 
is  specifically  tailored  to  armor  applications.  In  each  problem,  it  is 
assumed  that  the  means  from  two  independent  groups  are  being  compared. 

The  generality  of  this  design  to  other,  more  complicated  designs  is 
discussed  in  the  final  section.  Sample  size  (N)  refers  to  the  number  of 
crews  assigned  to  each  group.  Assuming  equal  sample  sizes  (i.e., 

N  -  Ni  «  N2)'  total  number  of  crews  assigned  to  such  an  experiment 
would  be  equal  to  2N.  If  samples  are  not  equal,  the  value  of  N  is 
calculated  as  the  harmonic  mean  of  the  two  sample  sizes  or 
2Ni^/(Ni  +  N2)- 


Two  General  Power  Analysis  Problems 

Welkowitz  et  al.  (1982)  describe  two  types  of  power  analysis  problems 
that  may  potentially  Interest  the  gunnery  researcher:  power  determination 
and  sample  size  determination.  Each  of  these  is  described  below  along 
with  examples  of  armor  gunnery  performance  problems. 

Power  determination.  The  power  of  a  test  can  be  calculated  either 
before  or  after  the  fact  provided  the  researcher  has  the  following  data: 
(a)  an  estimate  of  the  actual  difference  between  means,  (b)  an  estimate  of 
the  standard  deviation  of  performance  measures,  and  (c)  a  proposed  or 
actual  sample  size.  To  calculate  power,  the  researcher  must  first  combine 
the  mean  difference  and  the  standard  deviation  into  an  effect  size  measure 
using  Formula  1.  The  value  of  y  snd  N  are  then  used  to  calculate  6 
(delta)  as  follows: 

5  -  Y  (N/2)1/2  ^2) 
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Power  is  a  direct  function  of  6  and  can  be  simply  obtained  from  Appendix 
Table  C-1  (from  Velkowitz  et  al.,  1982). 

As  an  example  problem,  suppose  a  researcher  suspects  that  a  new 
training  program  would,  at  most,  decrease  average  opening  time  on  U-COFT 
by  about  one  second.  Furthermore,  he  knows  that  he  can  obtain  only  two 
companies  of  crews  for  his  research.  Thus,  comparisons  between  the  two 
companies  would  be  based  on  a  sample  size  of  14.  With  this  information, 
he  can  calculate  the  probability  of  detecting  a  true  difference  before 
performing  the  research.  First,  calculating  effect  size  from  Formula  1, 
we  have  y  -  1. 0/2.1  >  0.48.  Substituting  the  values  for  gamma  and  sample 
size  in  Formula  2,  we  obtain  6  *  0.48(14/2)^^^  1.27.  Assuming  a  two- 

tailed  test  and  a  >  .05,  the  expected  power  of  the  test  would  be  about  .26 
(value  from  Table  C-1).  In  other  words,  the  researcher  would  be  able  to 
reject  the  null  hypothesis  in  about  one  out  of  four  experiments  given  this 
actual  difference.  Because  of  the  low  power,  the  researcher  should 
consider  changing  the  design  of  his  experiment  to  somehow  increase  the 
effect  of  training  or  to  increase  sample  size. 

Sample  size  determination.  An  appropriate  sample  size  nmy  be 
determined  if  the  experimenter  knows  (a)  the  desired  power  level,  (b)  the 
standard  deviation  of  the  performance  measure,  and  (c)  the  actual 
difference  between  means.  Manipulating  Formula  2  to  solve  for  N  produces 
the  following  equation: 


N  -  2(6/y)^  (3) 

To  continue  the  previous  example,  the  researcher  may  conclude  that 
the  easiest  way  to  increase  the  power  of  his  statistical  test  is  to 
increase  his  sample  size.  To  determine  an  appropriate  sample  size,  he 
must  first  decide  on  an  ‘acceptable*  value  for  power.  For  sake  of 
argument,  assume  that  the  experimenter  considers  a  test  sufficiently 
powerful  if  it  correctly  rejects  the  null  in  two  out  of  three  cases,  i.e., 
if  power  is  at  least  .67.  From  Appendix  Table  C-2,  we  see  the  5  value 
corresponding  to  a  power  level  of  .67  is  2.39.  Substituting  this  value 
into  Formula  3,  we  obtain  N  ■  2(2. 39/. 48)^  *  49.6.  In  other  words,  the 
study  would  require  sample  sizes  of  at  least  50  crews,  or  100  crews  in 
all.  In  terms  of  unit  constraints,  this  result  implies  that  each  sample 
should  consist  of  about  one  battalion’s  complement  of  crews  (i.e., 

N  -  58). 


Determination  of  Minimum  Detectable  Difference 


A  third  technique  of  power  analysis  is  added  to  the  two  previously 
described  techniques.  This  method  capitalizes  on  the  fact  that  some  of 
the  values  of  power  determinants  are  either  known  or  are  constrained  in 
gunnery  research.  This  third  power  analysis  technique  may  be  termed 
determination  of  the  aiinimum  detectable  difference  (MDD)  between  means. 

The  MDD  is  the  smallest  actual  difference  between  means  (p;|^  -  U2)  that  can 
be  determined  to  be  significant  given  values  for  (a)  sample  size,  (b)  the 
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To  obtain  a  formula  for  the  minimum  detectable  difference,  either  Formula 
2  or  3  may  be  solved  for  Y  resulting  in 

Y  -  6 (A) 

Then  substituting  the  Formula  for  Y  and  solving  for  -  y2,  the  resulting 
formula  for  MDD  is 


yi  -  y2  “  o6(2/N)1/2.  ^5) 

The  first  parameter  for  this  analysis  (sample  size)  is  constrained  to 
a  few  likely  values,  i.e.,  the  numbers  of  crews  in  companies,  battalions, 
and  brigades.  The  second  parameter  (standard  deviation  of  performance 
measures)  may  be  obtained  from  empirical  data  sources.  The  third 
parameter  (desired  power  level)  may  be  set  according  to  the  following 
statistical  convention.  Researchers  regard  the  consequences  of  a  Type  II 
error  (failing  to  detect  an  actual  difference)  as  less  serious  than  the 
consequences  of  a  Type  I  error  (detecting  a  difference  that  is  not  real). 
Some  have  suggested  that  a  ratio  of  A  to  1  (Type  II  to  Type  I  error 
probabilities)  is  an  acceptable  relationship  between  the  two  types  of 
error  (Kirk,  198A).  Using  this  reasoning,  the  .05  level  for  a  implies 
that  .20  is  an  acceptable  value  for  13.  Because  13  is  the  complement  of 
power  (1  -  fi),  the  commonly  accepted  value  for  power  is  then  .80.  In 
terms  of  Equation  5,  power  of  .80  implies  a  6  equal  to  2.8  (from  Appendix 
Table  C-2). 

Given  that  all  the  parameters  of  Equation  5  may  be  specified,  a  table 
of  minimum  differences  for  each  performance  measure  may  be  generated 
assuming  a  two-tailed  test  and  a  ■  .05.  Table  2  shows  MDD  values  for  each 
of  the  two  measurement  media >  Table  VIII  and  U-COFT.  Comparing  across 
measurement  media,  the  corresponding  values  for  MDD  are  similar  owing  to 
the  nearly  equivalent  standard  deviation  values  shown  in  the  previous 
table.  Perhaps  the  most  notable  generalization  that  may  be  drawn  from 
this  analysis  is  that  tests  comparing  company-sized  samples  are  relatively 
insensitive  to  differences  between  means.  In  order  to  be  detected  by 
statistical  test  in  8  out  of  10  cases,  actual  mean  differences  from 
company- sized  samples  (N  ■  lA)  must  on  the  order  of  2  seconds  in  opening 
time,  over  12Z  in  hit  probability,  and  over  100  points  in  Table  VIII 
score.  Furthermore,  Hoffman  (1988)  showed  that  average  performance  on 
these  measures  for  the  Table  VIII  data  is  already  near  the  limit  of 
performance.^  Ceiling  and  floor  effects  make  it  extremely  unlikely  that 
treatments  can  improve  average  performance  enough  to  be  detected  by 
statistical  comparisons  of  two  groups.  The  conclusion  drawn  from  these 
data  is  that,  whereas  company-sized  samples  may  be  sufficient  to  fulfill 
the  requirements  of  parametric  statistics,  they  are  insufficient  to  detect 
all  but  the  most  drastic  differences  in  gunnery  performance. 


^His  Table  VIII  data  Indicate  that  crews  average  2.1  seconds  in  opening 
time,  81Z  hits,  and  8A5  Table  VIII  points  (out  of  a  possible  1000). 
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Table  2 


Minimum  Detectable  Differences  for  Gunnery  Performance  Measures  Obtained 
on  U-COFT  or  on  Table  VIII  Assuming  a  -  .05  and  Power  (i.e.,  1  >  fi)  -  .80 


Performance  Measure 
Sample  Size^ 

Units 

Medium 

Table  VIII 

U-COFT 

Target  ID  Time 

Company 

Seconds 

1.7 

Battalion 

— 

0.8 

Brigade 

' 

— - 

0.6 

Opening  Time 

Company 

Seconds 

1.8 

2.1 

Battalion 

0.9 

1.0 

Brigade 

0.6 

0.7 

First  Round  Hits 

Percent 

- 

Company 

14 

15 

Battalion 

7 

7 

Brigade 

5 

5 

Hits 

Company 

Percent 

13 

12 

Battalion 

6 

6 

Brigade 

4 

4 

Elevation  Error 
Company 

Mils 

.16 

Battalion 

— 

.08 

Brigade 

— 

.06 

Azimuth  Error 

Company 

Mils 

.36 

Battalion 

___ 

.18 

Brigade 

— 

.13 

Aiming  Error 

Company 

■Distance* 

.30 

Battalion 

.15 

Brigade 

— 

.10 

Table  VIII  Score 
Company 

■Points* 

104 

Battalion 

51 

— 

Brigade 

36 

--- 

^Sample  sizes  are  14,  58,  and  116  for  company,  battalion,  and  brigade 
respectively. 
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The  advantage  to  using  this  table  of  minimum  detectable  differences 
is  that  the  researcher  does  not  have  to  determine  a  difference  between 
means  a  priori.  He  can  instead  propose  a  performance  measure  and  sample 
size  and  see  if  the  value  of  the  MDD  is  ‘reasonable*  for  his  needs.  In 
other  words,  this  analysis  requires  that  the  researcher  confirm  a 
difference  value  from  a  table  rather  than  estimate  such  a  value.  This 
table  also  permits  the  researcher  to  make  tradeoffs  between  sample  size 
and  detectable  difference.  Nevertheless,  the  table  of  HDDs  should  not  be 
regarded  as  a  table  of  immutable  values.  The  table  should  instead  be 
regarded  as  a  best  guess  at  the  relationship  between  the  two  factors. 
Other  specific  limitations  of  this  approach  to  power  analysis  are 
discussed  in  the  next  section. 


LIMITATIONS  OF  THE  METHODS 


Although  these  techniques  should  help  the  gunnery  researcher  to  make 
more  systematic  decisions  about  sample  size,  there  are  situations  that  may 
invalidate  (or  at  the  least,  limit)  the  interpretation  of  the  results  from 
the  present  approaches.  For  instance,  these  techniques  apply  to  statisti¬ 
cal  hypotheses  about  means  and  not  to  hypotheses  about  other  attributes  of 
performance  distributions  (i.e.,  the  correlation  coefficient,  r).  However, 
analogous  procedures  could  easily  be  developed  for  such  attributes.  Other 
less  obvious  boundary  conditions  and  their  effects  on  power  are  discussed 
in  the  following  paragraphs. 


Sample  Sizes  Other  Than  Those  Specified 

Although  sample  sizes  are  constrained  by  the  organization  of  armor 
units,  samples  sizes  other  than  14,  56,  and  116  are  not  only  possible  but 
likely  under  some  circumstances.  For  instance,  the  tanks  of  the  company 
and  battalion  commander  and  their  executive  officer  may  not  be  available 
to  the  researcher.  Under  that  assumption,  one  would  obtain  unit  sizes  of 
12,  48,  and  96  for  company,  battalion,  and  brigade  separately.  The  MDDs 
for  these  sample  sizes  should  be  slightly  larger  than  the  tabled  values 
for  corresponding  units.  For  instance,  tabled  values  for  the  hits  measure 
on  the  U-COFT  are  12,  6,  and  4  for  company,  battalion,  and  brigade, 
respectively.  Assioming  that  commanders  and  executive  officer  tanks  are 
not  available,  the  values  of  MDD  recalculated  from  Formula  S  would  be  13, 
6,  and  4--not  much  difference.  As  a  second  exeunple,  different  sample 
sizes  could  be  obtained  by  concatenating  or  dividing  units.  For  instance, 
one  could  design  an  experiment  that  divides  a  battalion  in  two  groups, 
each  group  consisting  of  two  companies’  worth  of  tanks,  i.e.,  N  -  28.  In 
either  case,  one  could  determine  the  MDD  for  these  particular  sample  sizes 
by  using  Formula  5.  For  instance,  assuming  one  would  want  to  use  samples 
consisting  of  two  companies,  the  MDD  for  percent  hit  would  be 
(ll)(2.8)(2/28)^^^  ■  8.2.  An  even  simpler  procedure  is  to  recognize  that 
a  sample  size  of  two  companies  falls  about  halfway  between  one  company  and 
one  battalion.  Thus,  one  would  estimate  that  the  MDD  also  falls  about 
midway  between  the  two  points  or  (6  +  ll)/2  ■  8.5 — again,  not  far  from  the 


actual  calculated  value.  Thus,  the  table  provides  enough  data  points  so 
that  the  HDDs  of  sample  sizes  not  listed  on  the  table  may  be  estimated  or 
interpolated. 


Comparisons  Among  More  Than  Two  Groups 

Strictly  speaking,  the  present  techniques  do  not  apply  to  experi¬ 
mental  designs  that  compare  more  than  two  groups,  i.e.,  those  requiring 
one-way  analysis  of  variance  (ANOVA)  techniques.  Hinkle  and  Oliver  (1983) 
provided  methods  for  power  analysis  and  sample  size  determination  for 
comparisons  of  more  than  two  groups.  These  researchers  also  demonstrated 
by  example  how  the  technique  can  be  extended  to  determining  the  sample 
size  requirements  for  the  main  effects  of  higher  order  designs.  They 
acknowledged,  however,  that  determining  the  sample  size  for  testing 
interaction  effects  would  be  much  more  complex.  A  more  serious  problem 
with  this  technique  is  that  it  is  based  on  the  differences  between  the 
means  of  the  two  most  extreme  groups  and  assumes  that  the  remaining  groups 
do  not  differ  from  the  grand  mean.  If  the  intervening  group  means  take 
values  other  than  the  grand  mean,  the  estimate  of  the  between-groups  mean 
square  will  necessarily  be  larger  (Kirk.  1968).  As  a  consequence,  power- 
estimates  for  more  than  two  groups  tend  to  under-estimate  actual  values, 
and  sample  size  estimates  overestimate  actual  requirements. 

In  many  research  projects  entailing  more  than  two  groups,  the 
analysis  nevertheless  focuses  on  comparisons  between  two  means  at  a  time. 
If  the  researcher  were  able  to  specify  a  meaningful  set  of  orthogonal 
comparisons  between  means  a  priori,  the  procedures  described  herein  should 
apply  for  each  comparison.  The  rationale  for  this  assertion  is  that  the  o 
for  each  orthogonal  comparison  is  equal  to  the  stated  experiment-wise 
significance  criterion.  That  is,  the  stated  relationships  among  a,  B,  N, 
and  MDD  as  given  in  Table  2  should  apply  to  each  orthogonal  comparison. 

If  the  comparisons  are  not  orthogonal,  the  probability  of  a  Type  I  error 
for  each  comparison  is  greater  than  a,  the  experiment -wise  error  rate. 
Thus,  the  sample  size  estimates  would  tend  to  underestimate  sample 
requirements  for  nonorthogonal  requirements.  [For  an  extensive  discussion 
of  different  approaches  to  correcting  the  error  rate  for  nonorthogonal 
comparisons,  see  Kirk  (1968).] 


Within-Crew  Designs 

An  alternative  to  assigning  independent  samples  of  crews  to 
treatments  is  to  assign  a  single  sample  to  all  experimental  treatments. 
Such  *within-crew'  designs  are  more  powerful  than  between-group  designs 
because  differences  between  crews  can  be  Isolated  and  'removed*  as  a 
source  of  error.  The  residual  standard  deviation  may  be  calculated  as 

'^res  "  “  r2)l/2  (6) 

As  can  be  seen  from  Formula  6,  the  size  of  the  residual  standard  deviation 
is  dependent  on  the  correlation  between  repeated  measures  across  subjects: 
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As  the  correlation  increases,  the  residual  standard  deviation  decreases 
and  the  overall  treatment  effect  (Formula  1)  increases.  [Estimates  of  the 
correlations  between  repeated  measure  of  U-COFT  performance  are  provided 
by  Graham  (1986)  and  DuBois  (1987).]  Thus,  given  a  nonzero  correlation 
between  repeated  measures,  the  within-crew  design  is  more  likely  to  detect 
a  real  difference  between  treatments  compared  to  an  independent  groups 
design. 

The  problem  with  within-crew  comparisons  is  the  existence  of 
carry-over  effects  between  treatments.  If  the  focus  of  the  research  is  on 
carry-over  effects  per  se,  then  the  within-crew  design  is  appropriate. 

For  instance,  a  within-crew  design  would  be  appropriate  if  one  wished  to 
study  the  changes  in  performance  as  a  function  of  skill  acquisition  or 
retention.  If  the  research  does  not  focus  on  carry-over  effects,  the 
experiment  may  be  designed  to  counterbalance  and  actually  evaluate  these 
carry-over  effects.  Although  such  within- subject  designs  are  potentially 
more  powerful,  they  have  the  following  drawbacks:  (a)  they  require  more 
complex  management  of  the  research  effort  to  ensure  that  each  subject  gets 
the  proper  sequence  of  conditions,  (b)  they  usually  require  more  lengthy 
participation  by  each  subject,  and  (c)  they  sometimes  require  more 
subjects  to  fill  out  all  the  sequences  of  conditions  (D.  W.  Bessemer, 
personal  commtinication,  April  1988).  Finally,  even  counterbalancing 
cannot  be  used  to  compensate  for  independent  variables  whose  effects  are 
not  transitory.  Consider,  for  instance  the  case  where  an  experimenter 
wishes  to  compare  the  effects  of  two  training  techniques.  If  each  crew 
were  trained  using  both  techniques,  the  effect  of  the  treatments 
themselves  would  be  irrevocably  confounded  with  unknown  facilitative 
and/or  interfering  effects  between  treatments.  Thus,  whereas  the  within- 
crew  design  provides  a  more  powerful  approach  to  testing  mean  differences, 
the  design  is  only  appropriate  to  a  limited  subset  of  independent 
variables . 


Accuracy  of  Variability  Estimates 

Finally,  the  validity  of  the  power  analysis  methods,  discussed  in  this 
report  depends  on  the  accuracy  of  variability  estimates.  The  Table  VIII 
data  were  based  on  a  substantial  number  of  crews  and  the  estimates  appear 
to  be  reasonably  stable.  Furthermore,  now  that  Table  VIII  data  collection 
is  automated,  these  data  can  and  should  be  updated  from  time  to  time.  The 
U-COFT  performance  data  were  more  problematic  in  that  performance  measures 
were  based  on  fewer  crews  and  collected  under  varying  conditions.  In 
order  to  increase  the  stability  of  these  estimates  that  we  have  and  to  add 
variability  estimates  of  new  performance  measures,  more  U-COFT  variability 
data  will  be  needed.  This  assertion  only  reemphasizes  the  Importance  of 
researchers*  continuing  to  report  estimates  of  performance  variability 
along  with  estimates  of  central  tendency. 
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APPENDIX  A 


DEFINITIONS  OF  PERFORMANCE  MEASURES 


Performance  Measure 

Definition 

Target  ID  time 

Time  (in  seconds)  from  when  the  target  first 
appears  to  when  the  giumer  responds  'identified* 
to  the  tank  commander’s  fire  command. 

Opening  time 

Time  (in  seconds)  from  when  the  target  first 
appears  to  when  the  gunner  fires  the  first  round 
at  the  first  target. 

First-round  hits 

Percentage  of  total  engagements  wherein  the 
gunner  hits  the  target  with  the  first  round 
fired. 

Hits 

Total  number  of  hits  divided  by  total  rounds 
fired  expressed  as  a  percentage. 

Elevation  error 

Deviation  in  elevation  of  the  reticle  cross  hairs 
from  the  corz'ect  aiming  point  expressed  in  mils. 

Azimuth  error 

Deviation  in  azimuth  of  the  reticle  cross  hairs 
from  the  correct  aiming  point  expressed  in  mils. 

Aiming  error 

Conversion  of  elevation  and  azimuth  error  from 
angular  measure  to  a  single  'distance*  measure 
calculated  as  (elevation  error^  -f  azimuth 
error^)^^^. 

Table  VIII  score 

A  composite  score  based  on  performance  with  10 
different  Table  VIII  engagements.  On  each 
engagement,  a  crew  can  receive  a  maximum  of  100 
points  according  to  the  number  of  targets  hit  and 
the  time  required  to  hit  the  targets.  The  Table 
VIII  score  is  calculated  by  simply  summing  over 
all  ten  engagement  scores.  Procedural  errors 
(e.g.,  improper  fire  command)  can  reduce  this 
overall  score. 
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STANDARD  DEVIATION  ESTIMATES  FROM  U-COFT  RESEARCH 
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APPENDIX  C 


POWER  ANALYSIS  TABLES  FROM  WELKOWITZ  ET  AL. 


Table  C-1 

Power  As  a  Function  of  $  and  Significance  Criterion  a 


One-tailed 

test  (a) 

One-tailed 

test  (a) 

.05  .025 

Two-tailed 

.01  .005 

test  (a) 

.05 

.025 

Two-tailed 

01  .005 

test  (a ) 

5 

.10 

.05 

.02 

.01 

& 

.10 

.05 

.02 

.01 

0.0 

.10* ** 

.05* 

.02 

.01 

2.5 

.80 

.71 

.57 

.47 

0.1 

.10* 

.05* 

.02 

.01 

2.6 

.83 

.74 

.61 

.51 

0.2 

.11* 

.05 

.02 

.01 

2.7 

.85 

.77 

.65 

.55 

0.3 

.12* 

.06 

.03 

.01 

2.8 

.88 

.80 

.68 

.59 

0.4 

.13* 

.07 

.03 

.01 

2.9 

.90 

.83 

.72 

.63 

0.5 

.14 

.08 

.03 

.02 

3.0 

.91 

.85 

.75 

.66 

0.6 

.16 

.09 

.04 

.02 

3.1 

.93 

.87 

.78 

.70 

0.7 

.18 

.11 

.05 

.03 

3.2 

.94 

.89 

.81 

.73 

0.8 

.21 

.13 

.06 

.04 

3.3 

.96 

.91 

.83 

.77 

0.9 

.23 

.15 

.08 

.05 

3.4 

.96 

.93 

.86 

o 

00 

1.0 

.26 

.17 

.09 

.06 

3.5 

.97 

.94 

.88 

.82 

1.1 

.30 

.20 

.11 

.07 

3.6 

.97 

.95 

.90 

.85 

1.2 

.33 

.22 

.13 

.08 

3.7 

.98 

.96 

.92 

.87 

1.3 

.37 

.26 

.15 

.10 

3.8 

.98 

.97 

.93 

.91 

1.4 

.40 

.29 

.18 

.12 

3.9 

.99 

.97 

.94 

.91 

1.5 

.44 

.32 

.20 

.14 

4.0 

.99 

.98 

.95 

.92 

1.6 

.48 

.36 

.23 

.16 

4.1 

.99 

.98 

.96 

.94 

1.7 

.52 

.40 

.27 

.19 

4.2 

.99 

.99 

.97 

.95 

1.8 

.56 

.44 

.30 

.22 

4.3 

** 

.99 

.98 

.96 

1.9 

.60 

.48 

.33 

.25 

4.4 

.99 

.98 

.97 

2.0 

.64 

.52 

.37 

.28 

4.5 

.99 

.99 

.97 

2.1 

.68 

.56 

.41 

.32 

4.6 

** 

.99 

.98 

2.2 

.71 

.59 

.45 

.35 

4.7 

.99 

.98 

2.3 

.74 

.63 

.49 

.39 

4.8 

.99 

.99 

2.4 

.77 

.67 

.53 

.43 

4.9 

.99 

.99 

5.0 

.99 

5.1 

.99 

5.2 

** 

Note.  From  Introductory  Statistics  for  the  Behavioral  Sciences  (p.  363) 
by  J.  Welkowitz,  R.  B.  Ewen,  and  J.  Cohen,  1982,  New  York,  NY:  Academic 
Press.  Copyright  1982  by  Harcourt  Brace  Jovanovich,  Inc.  Reprinted  by 
permission  of  the  publisher. 

*  Values  inaccurate  for  one-tailed  test  by  more  than  0.1. 

**  The  power  at  and  below  this  point  is  greater  than  .995. 
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Table  C-2 

6  As  a  Function  of  Significance  Criterion  (a)  and  Power 


Power 

One>tailed  test  (a.) 

.05 

.025  .01 

Two-tailed  test  fa^ 

.005 

.10 

.05 

.02 

.01 

.25 

0.97 

1.29 

1.65 

1.90 

.50 

1.64 

1.96 

2.33 

2.58 

.60 

1.90 

2.21 

2.58 

2.83 

.67 

2.08 

2.39 

2.76 

3.01 

.70 

2.17 

2.48 

2.85 

3.10 

.75 

2.32 

2.63 

3.00 

3.25 

2.49 

2.80 

3.17 

3.42 

.85 

2.68 

3.00 

3.36 

3.61 

.90 

2.93 

3.24 

3.61 

3.86 

.95 

3.29 

3.60 

3.97 

4.22 

.99 

3.97 

4.29 

4.65 

4.90 

.999 

4.37 

5.05 

5.42 

5.67 

Note.  From  Introductory  Statistics  for  the  Behavioral  Sciences 
(p.  364)  by  J.  Velkowitz,  R.  B.  Ewen,  and  J.  Cohen,  1982, 

New  York,  NY:  Academic  Press.  Copyright  1982  by  Harcourt  Brace 
Jovanovich,  Inc.  Reprinted  by  permission  of  the  publisher. 
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